Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs
نویسنده
چکیده
Existing rollback-recovery methods using consistent checkpointing may cause high overhead for applications that frequently send output to the “outside world,” since a new consistent checkpoint must be written before the output can be committed, whereas existing methods using optimistic message logging may cause large delays in committing output, since processes may buffer received messages arbitrarily long before logging and may also delay propagating knowledge of their logging or checkpointing progress to other processes. This paper describes a new transparent rollback-recovery method that adds very little overhead to distributed application programs and efficiently supports the quick commit of all output to the outside world. Each process can independently choose at any time either to use checkpointing alone (as in consistent checkpointing) or to use optimistic message logging. The system is based on a new commit algorithm that requires communication with and information about the minimum number of other processes in the system, and supports the recovery of both deterministic and nondeterministic processes.
منابع مشابه
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
Manetho is a new transparent rollback recovery protocol for long running distributed computations It uses a novel combination of antecedence graph maintenance unco ordinated checkpointing and sender based message logging Manetho simultaneously achieves the advantages of pessimistic message logging namely limited rollback and fast output commit and the advantage of optimistic message logging nam...
متن کاملAn Application-Transparent, Platform-Independent Approach to Rollback-Recovery for Mobile Agent Systems
This paper proposes a new approach to rollback-recovery for mobile-agent systems, and describes its implementation in the MESSENGERS mobile agents system. The used checkpointing method allows to implement space and time efficient, user-transparent rollback-recovery in heterogeneous distributed environments. Together with an efficient non-blocking system snapshot algorithm this checkpointing met...
متن کاملCompletely Asynchronous Optimistic Recovery with Minimal Rollbacks
Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message loggingand replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However, previous optimistic rollback recovery protocols either have required synchroni...
متن کاملAsynchronous Optimistic Rollback Recovery Using Secure Distributed Time
In an asynchronous distributed computation, processes may fail and restart from saved state. A protocol for optimistic rollback recovery must recover the system when other processes may depend on lost states at failed processes. Previous work has used forms of partial order clocks to track potential causality. Our research addresses two crucial shortcomings: the rollback problem also involves t...
متن کاملImplementation and Performance of Transparent Rollback-recovery in Manetho
We describe the implementation and performance of rollback-recovery in Manetho. During failure-free operation, Manetho maintains an antecedence graph which records the \happened before" relation between certain events in the distributed computation. The antecedence graph is used in combination with checkpointing and volatile sender-based message logging to simultaneously achieve low failure-fre...
متن کامل